Journal of Computational Biology — Latest Matching Preprints

1

Towards a Unified Exact Solution of Rearrangement Small Parsimony for Natural Genomes

Bohnenkaemper, L.; Frolova, D.

2026-06-28 bioinformatics 10.64898/2026.06.23.733974 medRxiv

Top 0.1%

19.0%

Show abstract

Phylogenetic reconstruction is a fundamental problem in comparative genomics. As a theoretical problem in rearrangement studies, this has been modelled as the Small Parsimony Problem (SPP), in which ancestral genome structures have to be determined minimizing the number of rearrangement events occurring throughout the phylogeny. This problem is of significant interest in microbial and cancer genomics, due to the prevalence and clinical importance of rearrangement events. Genome structures in this problem are expressed as sequences of markers, which are themselves oriented sequence features (such as genes) that abstract from non-structural variations. Recent research has focused on the problem under the natural genomes model, in which arbitrary variations in copy number of markers are allowed. Natural genomes are often studied under the DCJ-indel model, a model which has already been successfully applied to plasmid data. There also exist ILP solutions to a variant of the Small Parsimony Problem under the DCJ-indel model. However, these solutions are limited in their applicability, as they make some critical simplifications for tractability purposes: ancestral marker frequencies and precomputed putative ancestral adjancencies, with their predicted likelihoods, are assumed as input. This creates multiple problems from both a theoretical and practical perspective. Firstly, this simplification means that not the full state space is searched for a solution, but rather only the subset of genomes with the precomputed putative adjacencies, meaning an optimal solution to the exact SPP is not guaranteed. Secondly, marker frequencies are given externally, without any theoretical guarantees. Thirdly, the method used to precompute adjacencies relies on gene trees, which requires the use of genes as markers, when gene annotation is often unreliable, especially in regions with a lot of rearrangement. Additionally, this restricts the applicability of the approach to sets of genomes that are both divergent and large enough to be able to produce informative gene trees. This is, for example, rarely the case for plasmids, where nucleotide mutations are rarer than rearrangements and genomes are small. Hence, we revisit the problem to solve the exact SPP by introducing a cost to indel operations, which allows us to compute ranges of marker frequencies and derive theoretical results, that allow us to reduce the solution space that the ILP searches without sacrificing optimality. We show that this makes the problem tractable for the case of small and recently related genomes, first on simulated genomes, and then on a set of pathogenic plasmids which represent a realistic use case for the method.

2

Extended t-cores for the de novo identification of transposable elements and other inexact repeats from short read RNAseq data

Darmon, S.; Mary, A.; Lacroix, V.

2026-07-10 bioinformatics 10.64898/2026.07.06.736737 medRxiv

Top 0.1%

10.9%

Show abstract

Transcribed repeats represent a major challenge in the de novo assembly of transcriptomes from short RNA-seq reads. Young transposable elements (TEs) and other inexact repeats create dense and ambiguous regions in the assembly graph, preventing the correct assembly of transcripts. In this paper, we introduce a fully de novo method based on the discovery of dense regions in the compacted De Bruijn graph (DBG) to identify such repeats directly from short reads RNA-seq data, without requiring a reference genome or repeat database. Our approach defines the extended t-cores, subgraphs of the DBG that capture the complex topology induced by highly expressed inexact repeats appearing in RNA-seq reads. Independently of its interest for transcriptome assembly, the proposed method appears to be effective for the de novo identification of repeats in transcriptomes. After classifying cores using sequence-based motifs to distinguish simple repeats from potential TEs, we demonstrate its potential for the de novo discovery of transposable elements. We validate the approach on a Mus musculus dataset using expressed TE consensus sequences, showing that extended t-cores correspond to known expressed TE families. We also illustrate its de novo discovery potential on a non-model species, Canis lupus familiaris, where the method was also able to recover known transposable elements.

3

BertST: BERT-based Spatial Domain Identification in Patient Data

Nnadi, G. O.

2026-07-09 bioinformatics 10.64898/2026.07.04.736527 medRxiv

Top 0.1%

6.7%

Show abstract

Spatial transcriptomics enables the study of gene expression within its native tissue context, providing critical insights into cellular organization and microenvironment-driven biological processes. A key challenge in this field is spatial domain identification, which aims to partition tissue into coherent regions by jointly leveraging gene expression and spatial information. Existing approaches are predominantly based on Graph Neural Networks (GNNs), and approach based on Transformers particularly, Bidirectional Encoder Reppresentation Transformer (BERT) model for modelling both local and long-range dependencies remains largely unexplored. In this work, we propose BERT for Spatial Transcriptomics (BertST), a transformer-based framework that reformulates spatial transcriptomics as a graph-to-text representation learning problem. Building upon the BERTwalk paradigm, we construct a task-specific multi-graph representation integrating spatial adjacency, pruned gene-expression similarity, and a fully connected gene-expression graph. This design enables the modelling of both local spatial structure and global molecular relationships. Random walks over these graphs are treated as sequences, allowing a BERT model to learn contextualised node embeddings. To further enhance representation quality, we introduce a hierarchical multi-graph propagation strategy, where embedding refinement is performed sequentially: first on the fully connected graph to capture global structure, followed by the pruned graph to refine molecular relationships, and finally on the spatial graph to enforce local smoothness. This ordering ensures that global information is effectively distributed and progressively constrained by biologically meaningful neighbourhoods. We also improve computational efficiency by leveraging \textit{PecanPy}, a fast and scalable implementation of node2vec, enabling efficient random walk generation on dense graphs. Experimental results on multiple 10x Visium datasets, including DLPFC and Human Breast Cancer, demonstrate that BertST consistently outperforms or matches GNN-based methods such as ConST, CCST, and SpaceFlow in terms of Adjusted Rand Index (ARI) and Adjusted Mutual Information (AMI). Overall, BertST highlights the potential of transformer-based architectures for spatial omics analysis by effectively capturing both local and long-range spatial-molecular dependencies, offering a promising alternative to traditional graph-based methods.

4

ProtAug: An Empirical Investigation of pLM-Guided Data Augmentation for Protein Sequence Prediction Tasks

Chen, Z.; Wang, R.; Luo, Q.

2026-07-11 bioinformatics 10.64898/2026.07.10.737545 medRxiv

Top 0.2%

4.3%

Show abstract

Protein language models (pLMs) offer great potential for protein sequence analysis, yet the scarcity of labeled data often limits their effectiveness in fine-tuning. Data augmentation is a promising remedy, but systematic evaluation of augmentation strategies for protein sequences remains limited, and the conditions under which augmentation confers downstream benefits are not well understood. In this paper, we systematically investigate pLM-guided substitution-based augmentation across seven protein prediction tasks. We propose ProtAug, a framework that leverages encoder-based (ESM-2) and autoregressive (ProtGPT2) pLMs to generate augmented sequences with user-controlled variation levels. Our investigation focuses on four questions: (Q1) whether pLM-synthesized sequences preserve more original signals than simpler methods, (Q2) to what extent augmentation improves prediction performance, (Q3) how variation levels affect downstream accuracy across tasks and models, and (Q4) whether biological plausibility is a necessary condition for achieving improvement. Our experimental results show that: (1) ProtAug Esm generally preserves motifs and structural similarity better than simple substitution, often comparable to homology retrieval; (2) augmentation yields consistent but task-dependent improvements, with ProtAug Esm achieving the best or second-best performance in 5 out of 7 tasks at 10% variation; (3) low-to-moderate variation levels (2-30%) perform best overall, although high-variation augmentation can benefit certain structure-related tasks; (4) the necessity of biological plausibility is task- and variation-dependent--while semantic preservation correlates with performance at low-to-moderate variation levels, improved generalization at high variation levels suggests that regularization effects, rather than label preservation, can also drive performance gains.

5

DDTRN: Predicting Bacterial Transcriptional Regulatory Networks Based on Gene Sequences using Dual Descriptor

Nie, P.; Ma, B.-G.

2026-07-01 bioinformatics 10.64898/2026.06.30.735580 medRxiv

Top 0.2%

4.3%

Show abstract

Accurate computational reconstruction of bacterial transcriptional regulatory network (TRN) from sequence information alone remains a fundamental challenge in systems biology, particularly for non-model organisms lacking extensive transcriptomic data. We present DDTRN, a sequence-driven framework that formulates TRN inference as a binary classification task over concatenated regulator-target gene sequence pairs and employs a Dual Descriptor (DD) model to predict regulatory interactions. The DD architecture represents a sequence into two learnable components: Composition Weight Map (CWM) and Position Weight Function (PWF). We comprehensively evaluate DDTRN against six conventional machine learning baselines across eight benchmark bacterial datasets, including E. coli (DREAM5, RegulonDB), B. subtilis, S. enterica, C. glutamicum, M. tuberculosis, P. aeruginosa, and S. coelicolor. DDTRN achieves superior overall performance, attaining average AUROC and AUPR scores of 0.869 and 0.868, respectively, with particularly pronounced advantages at lower descriptor ranks where positional weighting compensates for limited sequence context. Systematic sensitivity analyses of rank, embedding dimension, and basis function count reveal stable optimal operating regimes, while subsampling experiments demonstrate strong robustness even with limited training data. Interpretability analyses show that PWF learns distinct periodic contributions across different rank granularities and that CWM preferentially weights meaningful k-mers. A case study on E. coli dataset further illustrates that DDTRN identifies method-specific candidate targets complementary to those proposed by conventional approaches. By operating solely on genomic sequence, DDTRN provides a scalable, interpretable, and data-efficient framework for bacterial TRN inference in species where expression data are scarce, and it establishes a foundation for future multimodal integration with condition-specific regulatory information.

6

Does ensembling improve feature attributions from sequence-to-activity models?

Maslova, A.; Libbrecht, M.

2026-07-13 bioinformatics 10.64898/2026.07.08.737315 medRxiv

Top 0.3%

3.4%

Show abstract

Sequence-to-activity models take as input DNA sequence and predict genomic activities such as transcription factor binding and gene expression. Applying explainable AI (xAI) methods such as DeepLIFT to these models has recently led to breakthroughs towards many genomic problems, including transcription factor binding grammar and predicting effects of genetic variants. However, there remains significant uncertainty about the reliability of sequence-to-activity interpretations. Thus, we need accurate probabilistic measures of confidence to distinguish reliable from unreliable interpretations. Towards this end, researchers have recently aimed to characterize variability across ensembles of S2A models. However, previous work has focused on using model ensembles to improve the model predictions themselves. Here, we aim to evaluate whether model ensembles can also be used to improve feature attributions from post-hoc xAI methods. We find that ensembling attributions from multiple models improves downstream applications, including identifying transcription factor motifs and predicting regulatory genetic variants. We show that forming an ensemble using Monte Carlo Dropout (MCDropout) gets near to, but does not match, the performance of training multiple models, at much less train-time computational cost.

7

Beyond infinite sites: Generalized ABBA-BABA statistic for deeper phylogenies

Zhang, C.; Nielsen, R.

2026-07-08 bioinformatics 10.64898/2026.07.06.736715 medRxiv

Top 0.3%

3.2%

Show abstract

The Patterson's D statistic detects gene flow from ABBA-BABA site patterns, but its biallelic site patterns fail under deeper divergences where multiple hits cause false positives. We propose two extensions, D+ and D*. Both incorporate multiallelic site patterns to reduce saturation bias under JC and F84 model. Simulations show that D+ and D* both remain correctly null under all conditions and detect gene flow effectively, with distinct advantages: D+ guarantees non-negativity of the denominator, while D* provides greater robustness when mutation rates vary across genomic regions. The source code and binary files are publicly available at https://github.com/chaoszhang/ASTER.

8

Binary search and and set operations on compacted k-mer lists

Dufresne, Y.; Andreace, F.

2026-07-03 bioinformatics 10.64898/2026.06.29.735436 medRxiv

Top 0.3%

3.0%

Show abstract

Sorted lists of elements are particularly good for computing set operations. A single scan of the two lists is sufficient to materialize or count the results of the union, intersection, difference, and xor operators. In bioinformatics, only a few tools are designed to perform these operations on k-mers. A fast tool like KMC allows set operations at the cost of storing individual k-mers. In this paper, we introduce a novel way to represent sorted k-mers as a collection of recomposed super-k-mer sorted lists. We introduce the concept of virtual super-k-mer and show how to construct, query and perform set operations on sorted lists of virtual super-k-mers. In the implementation sklib, we demonstrate high throughput of the data structure for construction and set operations, while remaining competitive in query capabilities, within a controlled memory footprint (2-5x decrease in bits/element compared to KMC).

9

GCBM-DCT-HV-Bio-NL-Grow-CHG-CSM-RHEC: A Unified Geometric, Biological, Causal, and Regenerative Framework for Mechanism-Aware Tissue and Connectome Modeling

Xu, T.; Hu, Z.; Sun, X.; Jin, L.; Xiong, M.

2026-06-29 bioinformatics 10.64898/2026.06.24.734320 medRxiv

Top 0.4%

2.5%

Show abstract

Modern biological prediction problems increasingly require models that go beyond Euclidean feature regression and local graph smoothing. Tissue, cellular, and connectome systems are nonlinear, geometry-dependent, intervention-sensitive, history-dependent, and subject to regenerative or homeostatic constraints. We propose GCBM/DCT/HV/Bio/NL/Grow/CHG/CSM/RHEC, a unified model for mechanism-aware biological prediction. The model integrates geometric connectome dynamics, differentiable charted tissue geometry, Hamiltonian latent transport, nonlinear biological kinetics, nested latent memory, continual growth without overwriting, causal hypergraph structure, causal structure modeling, and regenerative homeostatic error correction. Unlike Euclidean baselines, which treat observations as flat vectors, and local graph baselines, which use neighborhood smoothing without mechanistic structure, the proposed model represents biological states (Trapnell 2015) as coupled geometric, dynamical, causal, and regenerative objects. We evaluate the model on four synthetic toy studies, Toy A, B,C, D, designed to reflect increasing biological complexity: local Euclidean structure, nonlinear mechano-chemical interaction, causal intervention response, and out-of-distribution regenerative shift. Compared with Euclidean and local graph baselines, the full model achieves the lowest mean squared error across all four toy studies. Relative to the Euclidean baseline, the full model reduces MSE by approximately 63.0%, 89.1%, 89.0%, and 90.9% on Toy A, Toy B, Toy C, and Toy D, respectively. These results support the value of integrating geometry, mechanism, causal structure, adaptive growth, and regenerative correction into a single predictive architecture (Figure 1).

10

SEMFA: A General Framework for Inferring Statistical Significance of Mahalanobis Similarity between Multi-Omics Profiled Samples Built on Multiple Factor Analysis

Han, J.; Luo, W.; Baldwin, E.; Zhang, H. H.; An, L.; Liu, J.; Li, H.

2026-06-24 bioinformatics 10.64898/2026.06.18.733287 medRxiv

Top 0.4%

2.3%

Show abstract

MotivationWith rapid advances in sequencing technologies, many heterogeneous omics datasets have been generated, as seen in the Encyclopedia of DNA Elements (ENCODE) and many single-cell multi-omics sequencing projects, bringing substantial challenges to existing integrative methods. In this article, we report a novel multi-omics fusion and analysis software SEMFA which performs general parametric tests for the Mahalanobis Similarity of samples based on the factor scores generated by an Extended version of conventional Multiple Factor Analysis. ResultsOur developed method is effective and robust under both Gaussian and non-Gaussian assumptions. The mean F1 scores are over 0.8 when the column similarity level is 0.9 and the noise level ranges between 0.1 and 0.2, using simulation studies based on ENCODE count data. It was also efficient and effective at handling large-scale single-cell multi-omics data, as demonstrated in colon cancer cases as it unveiled signature network organization patterns of cells for stages III and IV.

11

An axiomatic approach to cultivar ranking in multi-environment trials

Kondratev, A. Y.; Ianovski, E.; Voronina, E.; Crossa, J.

2026-07-01 genetics 10.64898/2026.06.27.734959 medRxiv

Top 0.5%

2.1%

Show abstract

Multi-environment trials are central to cultivar evaluation because they reveal how candidate cultivars perform across locations, years, management conditions, and stress environments. The resulting yield matrix is a rich source of data on genotype-by-environment interaction, and a wide literature on estimation, decomposition, visualisation, and prediction of yield potential and stability has flourished. However the ultimate question of which cultivar to recommend on the basis of such a matrix is often left implicit. The question is far from trivial, and in this paper we formulate cultivar recommendation as an axiomatic ranking problem. This framework is rich enough to encompass the existing literature on stability indices, as well as any other deterministic ranking procedure. We show that many commonly used stability-based procedures can violate minimal criteria of efficiency or consistency. The result of such violations is that a cultivar with uniformly high yield could be ranked below a cultivar with uniformly low yield, or the relative ranks of two cultivars could depend on whether or not a third cultivar is present in the matrix. Our results prove that under a small number of such criteria the space of admissible rules collapses to the family of power means and their limiting cases. If we further wish to allow multiplication normalisation of yield, we are left with the geometric mean as the unique solution.

12

Gene-specific exponent-corrected normalization for library size in bulk RNA-seq

Yin, R.; Li, D.; Zong, W.; Ketchesin, K. D.; Seney, M. L.; McClung, C. A.; Baldoni, P. L.; Tseng, G. C.

2026-07-09 bioinformatics 10.64898/2026.07.04.736167 medRxiv

Top 0.5%

2.1%

Show abstract

Correcting for library size is an essential step in bulk RNA-seq analyses, as differences in sequencing depth across samples can obscure biological signal with technical noise. While numerous normalization methods and model-based strategies have been proposed, we demonstrate here that library size-normalized counts and differential expression results obtained from such widely adopted approaches often remain strongly correlated with library size in large-scale RNA-seq experiments. Through a systematic analysis of over 100 publicly available GEO and TCGA RNA-seq datasets with raw count data, we show that library size association is observed for a substantial proportion of genes even after state-of-the-art library size correction approaches recommended by leading normalization tools. To address this issue, we propose gecco, a gene-specific exponent-corrected normalization method for RNA-seq counts that incorporates library size directly into the statistical framework via a gene-specific correction term, rather than applying a uniform adjustment factor across all genes. This formulation generalizes existing normalization approaches and yields normalized counts that are free of residual library size effects. Using both simulation studies and real large-scale RNA-seq datasets, we show that our method mitigates library size bias while preserving biological signal across a range of parameter settings. We further demonstrate that our approach leads to higher detection accuracy and more biologically meaningful pathway enrichment results in downstream differential expression and rhythmicity analyses without compromising false discovery rate control. Our method is implemented in R and is fully compatible with the widely used differential expression analysis methods DESeq2 and edgeR.

13

A foundation model enables prediction of natural product molecular properties, bioactivity, and structural similarity from biosynthetic gene cluster sequence

Walker, A.

2026-07-07 bioinformatics 10.64898/2026.07.05.736569 medRxiv

Top 0.5%

1.9%

Show abstract

Genome mining is a powerful technique in natural product discovery, where biosynthetic gene clusters that are likely to produce novel or desirable natural products are identified through bioinformatic analysis. There are many more predicted biosynthetic gene clusters than can easily be experimentally characterized. Additional computational methods to prioritize biosynthetic gene clusters by the bioactivity, structural properties, or novelty of the product would make genome mining more efficient. Multiple machine learning/artificial intelligence models have been developed to predict product properties from biosynthetic gene cluster sequence, but they are limited by small quantities of training data. Model pretraining with unlabeled data is a powerful technique to develop models that can learn on a limited amount of labeled training data. Biosynthetic gene clusters are well suited to this strategy because there are many predicted clusters with only a small percentage being characterized. This paper reports BGC-MLM, a foundation model that is pretrained with a masked language task on predicted biosynthetic gene clusters and then fine-tuned for downstream applications including prediction of product structural class, bioactivity, chemical properties, counts of functional groups, and chemical fingerprint. Comparison to a model trained without pretraining shows that pretraining generally improves performance. BGC-MLM shows better or similar performance to existing specialized methods for these tasks, demonstrating its utility as a foundation model for natural product genome mining.

14

Metagenomic contextualization of proteins with state space models

Azbijari, N.; Wynne, J. H.; David, M.; Thurber, A. R.

2026-07-11 bioinformatics 10.64898/2026.07.07.736993 medRxiv

Top 0.5%

1.9%

Show abstract

Since the early adoption of metagenomics (the culture-free sequencing of microbial community genomes) in 2011, sequence data has increased over 500-fold across ecosystems. This surge in data has outpaced reliable taxonomic and functional annotation, with over half of sequences lacking confident functional assignment. These unknown sequences limit our understanding of microbial processes central to planetary health and human health. Recent advances in genomic language modeling have made progress in the interpretation of metagenomics datasets. Most state-of-the-art models rely on transformer architectures, which limit the maximum sequence length and therefore capture only a fraction of assembled metagenomic sequences due to the quadratic scaling of attention. This prevents training and inference on sequences with broad context, including multiple coding and non-coding regions. To overcome this limitation, we propose leveraging new model architectures that scale linearly with sequence length, making them more suitable for modeling longer metagenomic sequences. Here, we introduce Nammu, a mixed-modality Mamba-based foundation model with 167M parameters trained on the OpenMetaGenomic (OMG) corpus. Nammu is a bidirectional encoder trained with a 20K context length using a curriculum strategy, first on 64M protein sequences and then on 32M mixed-modality metagenomic contigs. We compared Nammu to gLM2, a mixed-modality transformer also trained on OMG using 37% more tokens, using taxonomy inference on a marine dataset from the Critical Assessment of Metagenome Interpretation (CAMI). Nammu outperforms gLM2 at every taxonomic level. We further assessed function via KEGG Orthology prediction in deep-sea metagenome-assembled genomes, where Nammu outperforms gLM2 (150M). These results demonstrate improved performance.

15

Hidden sampling biases inflate performance in gene regulatory network inference

Stock, M.; Ratajczak, F.; Bertin, P.; Hoermanseder, E.; Bengio, Y.; Hartford, J.; Falter-Braun, P.; Heinig, M.; Tong, A.; Scialdone, A.

2026-07-14 bioinformatics 10.64898/2025.12.19.695616 medRxiv

Top 0.5%

1.9%

Show abstract

Accurate reconstruction of gene regulatory networks (GRNs) from single-cell transcriptomic data remains a major methodological challenge. Recent machine learning approaches, particularly graph neural networks and graph autoencoders, have reported improved performance, yet these gains do not consistently translate to realistic biological settings. Here, we show that a key reason for that is the way negative regulatory interactions are sampled for supervised training and evaluation. We find that widely used sampling strategies introduce node-degree biases that allow models to exploit trivial graph-structural cues rather than biological signals. Across multiple benchmarks, simple degree-based heuristics match or exceed state-of-the-art graph neural network models under these biased evaluation protocols. We further introduce a degree-aware sampling approach that eliminates these artifacts and provides more reliable assessments of GRN inference methods. Our results call for standardized, bias-aware benchmarking practices to ensure meaningful progress in supervised GRN inference from single-cell RNA-seq data.

16

Model-based inference of gene expression noise from single-cell RNA-sequencing data

Giersdorf, F.; Rogers, D. W.; Christensen, S.; Dutheil, J. Y.

2026-06-23 bioinformatics 10.64898/2026.06.18.733122 medRxiv

Top 0.6%

1.7%

Show abstract

The heterogeneity of expression levels among genetically identical cells, termed gene expression noise, is a property of the gene expression process whose importance in the biology of organisms and their evolution is increasingly recognized. Measuring gene expression noise requires single-cell expression data, as obtained from single-cell RNA sequencing (scRNASeq). Its estimation, however, is challenging owing to (i) the presence of technical noise in addition to biological noise, and (ii) the heterogeneity of cell types in the sampled population. We propose a maximum-likelihood framework to infer biological noise from scRNASeq data, while accounting for technical noise, dropout probabilities, and distinct cell sequencing depths. We demonstrate the parameter identifiability using simulations and that the resulting noise estimates are uncorrelated from the mean gene expression, and therefore do not need extra correction in downstream analyses, easing intra- and inter-genome comparisons. Using two technical replicates of scR-NASeq data from the wild yeast Saccharomyces paradoxus, we show that expression noise can be inferred in a reproducible manner.

17

Interspecies Differential Gene Expression Analysis with Regularized Phylogenetic Linear Models

Gallopin, M.; Daunesse, M.; Lespinet, O.; Liehrmann, A.; Bastide, P.

2026-07-03 evolutionary biology 10.64898/2026.06.30.734542 medRxiv

Top 0.6%

1.7%

Show abstract

Comparative transcriptomic datasets are increasingly used to investigate the molecular basis of phenotypic diversification across species. However, finding genes that are differentially expressed (DE) between lineages remains challenging, for two main reasons. First, the random evolutionary drift can blur the signal left by lineage-specific shifts in mean expression, and induces phylogenetic correlations that, if ignored, can widely inflate the False Discovery Rate (FDR), i.e., the amount of spuriously detected genes. Second, DE analysis from RNA-Seq data involves multiple testing on many genes for a small number of individual measurements with high noise, and requires dedicated statistical tools. Traditional DE tools, such as limma, and classical Phylogenetic Comparative Methods (PCMs), such as the Expression Variance and Evolution (EVE) model, are both designed to tackle one of these two challenges alone, but both fail in the context of inter-species RNA-Seq data. In this work, we present phyloDE, a new tool for inter-species DE, that aims at taking the best from both approaches. On simulations based on a recently published four-species rodent dataset, we show that, contrary to other methods, phyloDE correctly controls the FDR in all settings, while keeping a reasonable power. When reanalyzing the empirical dataset, phyloDE discovers more DE genes that exhibit consistent changes in their cis-regulatory landscape compared to EVE in all the experimental settings. The method is implemented in R, with an interface inheriting from limma.

18

Why linkage disequilibrium measures disagree: Fisher geometry of rare common haplotype structure

Ichikawa, Y.

2026-07-07 genetics 10.64898/2026.07.02.736022 medRxiv

Top 0.7%

1.5%

Show abstract

Conventional LD measures such as r2 perform poorly in the rare common regime, particularly in asymmetric configurations such as nested haplotype structure. Because r2 is symmetric and quadratic, it removes directional structure in two ways: squaring discards the sign, or phase, retained by the signed LD coefficient D, while symmetric normalization hides the asymmetry between the conditional probabilities P(A|B) and P(B|A). Although D recovers the phase, it is locus symmetric and unnormalized; its magnitude is hard to compare across frequency regimes and it does not by itself express which way the asymmetry runs. We therefore analyze the conditional-probability asymmetry {Delta} = P(A|B) - P(B|A), together with r2 and D, as distinct scalar functions on the haplotype simplex under the Fisher information metric. The conditional probabilities P(A|B) and P(B|A) are bounded in [0, 1], directly express carrier-set inclusion, and are more readily visualized than D. Moreover, their difference admits the exact decomposition {Delta} = M + C into a marginal frequency term M and an LD-coupled term C. Prior work has characterized either the mathematical behavior of LD normalizations across allele-frequency space or the Fisher geometry of the haplotype simplex, but not their connection. We bridge this gap by showing that the geometric structure of the simplex explains why LD measures disagree in the rare common regime and why symmetric normalizations such as r2 lose directional information. We show that the fixed-frequency leaf is intrinsically anisotropic, positively curved, and frequency-dependent under the Fisher metric. These geometric predictions are tested empirically , in phased 1000 Genomes data1 and a two locus Wright Fisher model, in a companion paper (Ichikawa, preprint); the present note develops the geometry itself. Keywords: linkage disequilibrium; Fisher information metric; haplotype simplex; rare variant; conditional-probability asymmetry; nested haplotype structure

19

synpact: accurate, memory-light PacBio HiFi read mapping via a hierarchy of locally-consistent syncmer blocks

Aydin, M. S.; Sahlin, K.

2026-07-02 bioinformatics 10.64898/2026.06.28.735066 medRxiv

Top 0.7%

1.4%

Show abstract

Motivation: Mapping PacBio HiFi reads is a routine task and serves as a central step in many bioinformatics analyses. However, the most accurate long-read mappers have a high memory consumption and are slow. Some light-weight mappers have been proposed for faster runtime, but their accuracy is not comparable to state-of-the-art mappers. With the increasing number of available reference sequences, memory-efficient and fast methods for read mapping without the large accuracy drop are desired. A general trade-off with seed-chain-extend mappers is selecting a single, fixed seed size, which forces a compromise between sensitivity and specificity. Results: We present synpact, a long-read mapper that uses several seed sizes (a hierarchy) constructed with Locally Consistent Parsing (LCP) over syncmers. A read is mapped by querying for matches at different levels, followed by sliding window voting. By storing only the coarse upper levels rather than the full hierarchy, the index holds several times fewer entries, while still handling errors by falling back from coarser to finer stored levels at query time. We benchmark synpact against popular long-read mappers on four genomes and different read lengths. For simulated PacBio HiFi data, synpact matches or approaches minimap2 accuracy with higher precision in most cases, while using roughly 5-13 times less peak memory (e.g., about 0.8GB vs. 10.7GB on human) and mapping faster on large or repetitive genomes (e.g., about 10 to 13 times faster than minimap2 on rye). On real HiFi reads synpact has high concordance with minimap2 across the four genomes, as opposed to the other lightweight long-read mappers. Availability and Implementation: synpact is written in Rust and is available at https://github.com/mahmudsami/synpact

20

An evaluation of clustering and assembly strategies from Iso-Seq data in the absence of reference genomes in non-model animals

Eleftheriadi, K.; Vazquez-Valls, M.; Fernandez, R.

2026-07-08 evolutionary biology 10.1101/2025.09.18.677004 medRxiv

Top 0.8%

1.3%

Show abstract

Transcriptome assembly enables the recovery of expressed genes and isoforms, but the optimal strategy for reconstructing transcriptomes from long-read sequencing remains unresolved. In particular, establishing best practices for generating accurate gene models and selecting representative isoforms is essential for comparative genomics, as for orthology inference typically only the longest isoform per gene model is included. Here, we systematically compare clustering and de novo assembly methods using PacBio Iso-Seq data from 17 animal lineages spanning seven phyla, most of them non-model species, with the goal of investigating which methodology is more adequate to select one isoform per gene model, in the absence of specific pipelines to do so. We evaluate four approaches: isoseq cluster, CD-HIT, RNA-Bloom2 and isONform. We benchmark them with short-reads using Trinity, assessing assembly quality with BUSCO completeness, short-read mapping rates, coding sequence recovery, and longest isoform prediction. Our results show that CD-HIT clustering at high similarity thresholds ([≥]99%) yields the most complete and coding-rich long-read transcriptomes, rivaling Trinity while avoiding its high redundancy. Consensus-based methods such as isoseq cluster and isONform recover fewer single-copy orthologs (mirrored in a lower BUSCO score) and achieve lower mapping rates, while RNA-Bloom2 provide intermediate performance with reduced duplication. Together, these findings establish, to date, CD-HIT as a robust and practical strategy for transcriptome reconstruction from long-read data when genomic references are unavailable. By benchmarking de novo methods across a taxonomically broad dataset, this work defines the realistic capabilities of long-read transcriptome reconstruction in the absence of a reference genome and provides practical guidance for deriving high-quality gene models and selecting representative isoforms for orthology inference in non-model species.